library(tidyverse)
Blog Post 1
Description of the papers
Academic articles
Article 1: Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework
Authors: Prashanth Rao and Maite Taboada
Research question
Rao and Taboada aimed to study how often men and women are quoted in news articles (2021). They looked into different types of articles including sports, lifestyle, business and healthcare, to see if there was a difference in which gender was quoted in these topics.
Data
The data was from the Gender Gap Tracker Data; which scrapes data from seven major English Canadian newspapers’ websites. The data was taken for two years starting from October 2018. A total of 612,343 articles were used (Rao & Taboada, 2021).
Methods
In order to analyse the data, first the data about the people mentioned as well as the people quoted was taken into account. The gender of the speaker was determined based on a “cache of commonly quoted public figures” and an API which had the gender stored based on the person’s full or first name. The people who were quoted were a subset of the people mentioned and if they were quoted multiple times within the same article, their name was only counted once. The authors did acknowledge that this process limited the genders to a binary and could not take other genders into account. The authors utilised topic modelling with the Apache Spark’s parallel Latent Dirichlet Allocation in Python. The preprocessing steps included tokenisation, normalisation, lowercasing, stopword removal and lemmatisation. Tokenisation and normalisation removed the symbols and artifacts that were not required. Stopword removal removed common words that are not useful for interpretation. Lemmatisation looked at if the tokens were present in a dictionary of lemmas They also used the language analyses of keyness and dependency bigrams (Rao & Taboada, 2021).
Findings
In general, it was found that in the Lifestyle, Entertainment, Arts and Healthcare categories, women were quoted more whereas for the topics of Business, Sports and United States Politics, men were quoted more often. Men were more likely to be quoted in high-profile articles and women were more likely to be quoted in low-profile articles. Two aspects I found it interesting in the study was that the authors looked into the type of information that was reported for men and women and also how the type of news articles reported corresponded to particular events. For men’s sports, there were descriptions of the events whereas for women’s sports, their achievements were focused upon. In the business section, the male corpus had bigrams such as stocks and trading but the male corpus had small transactions, shopping and small or local businesses. Finally, it was seen that information about women’s and trans’ rights tended to be more in the news in February and March (both in 2019 and 2020). This corresponds to the date of International Women’s day in March (Rao & Taboada, 2021).
Topic ideas
- I found Rao and Taboada’s research of the frequency of women and men being quoted interesting and one research idea I had was to apply similar methods to the context of American newspapers or Indian newspapers.Perhaps, the main topics researched would be limited to two categories like sports and entertainment.
A related idea to this would be comparing how gender may be discussed differently in 2 newspaper companies such as the New York Times and Los Angeles Times.
A second idea I had was Observing the type of language used to describe advertisements for different genders in either the United States or India.
A third idea would be looking at Op-Ed pages and observing the gender that is featured more or if the genders tend to discuss a particular topic.
The sources for both would be newspaper articles. 1. This was an online archive I was able to find that has American newspapers: https://guides.loc.gov/consumer-advertising-great-depression/databases-and-archives
- In India, the Times of India and The Hindu are popular newspapers. I was able to find online archives for them as well. https://timesofindia.indiatimes.com/archive.cms https://www.thehindu.com/archive/
Doubts
In the Indian archives, the articles are not always separated by topic. Therefore, it would be difficult to do an analysis of the discussion of gender based on the newspaper topic.
Some of the archives are not free to use, so I was not sure how to collect data.
Some of the sites can detect bots, if this is the case, will it create issues if we are to use WebScraping?
I am still not sure which text-as-data method I would apply for my research topic ideas.
In one of the articles that I read, they discussed having a pipeline but I didn’t understand what that meant?
References
- Rao P and Taboada M (2021) Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework. Front. Artif. Intell. 4:664737. doi: 10.3389/frai.2021.664737
- Devinney,H., Björklund,J. & Björklund,H.(2020). Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, 79–92. https://aclanthology.org/2020.gebnlp-1.8